cognitiveclass.ai logo

Exploratory Data Analysis Lab

Estimated time needed: 30 minutes

In this module you get to work with the cleaned dataset from the previous module.

In this assignment you will perform the task of exploratory data analysis. You will find out the distribution of data, presence of outliers and also determine the correlation between different columns in the dataset.

Objectives

In this lab you will perform the following:


Hands on Lab

Import the pandas module.

Load the dataset into a dataframe.

Distribution

Determine how the data is distributed

The column ConvertedComp contains Salary converted to annual USD salaries using the exchange rate on 2019-02-01.

This assumes 12 working months and 50 working weeks.

Plot the distribution curve for the column ConvertedComp.

Plot the histogram for the column ConvertedComp.

What is the median of the column ConvertedComp?

How many responders identified themselves only as a Man?

Find out the median ConvertedComp of responders identified themselves only as a Woman?

Give the five number summary for the column Age?

Plot a histogram of the column Age.

Outliers

Finding outliers

Find out if outliers exist in the column ConvertedComp using a box plot?

Find out the Inter Quartile Range for the column ConvertedComp.

Find out the upper and lower bounds.

Identify how many outliers are there in the ConvertedComp column.

Create a new dataframe by removing the outliers from the ConvertedComp column.

Correlation

Finding correlation

Find the correlation between Age and all other numerical columns.

Authors

Ramesh Sannareddy

Other Contributors

Rav Ahuja

Change Log

Date (YYYY-MM-DD) Version Changed By Change Description
2020-10-17 0.1 Ramesh Sannareddy Created initial version of the lab

Copyright © 2020 IBM Corporation. This notebook and its source code are released under the terms of the MIT License.